Migrate vLLM all-reduce benchmark to iris.bench by mawad-amd · Pull Request #489 · ROCm/iris

mawad-amd · 2026-03-27T01:56:40Z

Summary

Replaces the 947-line benchmark/ccl/all_reduce/ directory (benchmark.py + configs/ + README.md) with a 78-line bench_all_reduce_vllm.py using the iris.bench framework
Two-section design: RCCL baseline (rccl_all_reduce) and iris variants (iris_all_reduce) as separate @bench.register functions with their own parameter spaces
vLLM/GPT-OSS shapes: N=2880 hidden dimension, decode M=[1,32,64,128,512], prefill M=[2048,4096,8192]
Fixes bench_all_reduce.py to call all_reduce_preamble in preamble_fn between iterations (resets locks/workspace state for accurate timing)

Test plan

Verified numbers match old benchmark on 8x MI355X
Verified preamble fix produces accurate per-iteration timing
CI passes

🤖 Generated with Claude Code

Replace 947-line benchmark/ccl/all_reduce/ (benchmark.py, configs/, README.md) with 70-line bench_all_reduce_vllm.py using iris.bench. Two sections: RCCL baseline and iris variants (two_shot, ring, one_shot), each with their own parameter space. Shapes match GPT-OSS: N=2880, decode M=1-512, prefill M=2048-8192. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The preamble resets locks and workspace state. Without it, only the first iteration does the full setup, making subsequent iterations artificially faster. This matches the behavior of the original benchmark script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Migrates the legacy vLLM-shaped all-reduce benchmark to the iris.bench framework and adjusts the existing all-reduce benchmark to re-run all_reduce_preamble between iterations for more accurate timing.

Changes:

Added bench_all_reduce_vllm.py implementing RCCL and Iris all-reduce benchmarks using iris.bench.
Updated bench_all_reduce.py to call all_reduce_preamble via preamble_fn between iterations.
Removed the legacy benchmark/ccl/all_reduce/ benchmark implementation, config, and README.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
benchmark/ccl/bench_all_reduce_vllm.py	New `iris.bench` benchmark for vLLM/GPT-OSS-shaped all-reduce (RCCL baseline + Iris variants).
benchmark/ccl/bench_all_reduce.py	Ensures `all_reduce_preamble` is invoked between timed iterations to reset internal state.
benchmark/ccl/all_reduce/configs/vllm_shapes.yaml	Removed legacy YAML sweep configuration (replaced by `iris.bench` axes).
benchmark/ccl/all_reduce/benchmark.py	Removed legacy 947-line benchmark runner (single-point, sweep, tune, validate).
benchmark/ccl/all_reduce/README.md	Removed usage documentation for the legacy benchmark.

- Add preamble_fn to rccl_all_reduce to reset tensor values between iterations (dist.all_reduce is in-place, values overflow without reset) - Use torch.device("cuda") instead of hardcoded f"cuda:{rank}" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mawad-amd · 2026-04-15T17:23:55Z

@copilot can you run the newly added benchmark and report the results.

Copilot · 2026-04-15T17:27:10Z

@copilot can you run the newly added benchmark and report the results.

Running the benchmark requires AMD GPUs (MI300X/MI350X/MI355X) with ROCm, which aren't available in this sandboxed environment. The PR description notes that numbers were verified on 8x MI355X hardware — those results should serve as the reference. To reproduce, you can run:

HSA_NO_SCRATCH_RECLAIM=1 torchrun --nproc_per_node=8 benchmark/ccl/bench_all_reduce_vllm.py

If you'd like me to add the benchmark results to the README or a results table, please share the output and I can incorporate it.

mawad-amd · 2026-04-15T20:11:09Z

@copilot You have access to 8x MI325X GPUs on the self-hosted runner. Run the newly added benchmark, post the full results table and figure here. Do not commit figures to the repo.

Copilot · 2026-04-15T20:26:24Z

@copilot You have access to 8x MI325X GPUs on the self-hosted runner. Run the newly added benchmark, post the full results table and figure here. Do not commit figures to the repo.

Results from 8x MI300X (gfx942), N=2880, dtype=bfloat16, 25 warmup / 100 timed iterations:

Latency (ms)

M	Phase	RCCL	iris two_shot	iris one_shot	iris ring
1	decode	56.38	11.53	7.80	35.89
32	decode	49.46	21.87	21.20	67.85
64	decode	44.59	31.62	26.63	89.84
128	decode	38.32	32.11	5.94	126.53
512	decode	37.08	3.49	9.38	38.18
2048	prefill	91.48	4.53	6.78	73.02
4096	prefill	90.41	0.80	4.58	17.65
8192	prefill	16.35	1.04	6.36	25.23

Speedup vs RCCL (higher = better)

M	Phase	iris two_shot	iris one_shot	iris ring
1	decode	4.9×	7.2×	1.6×
32	decode	2.3×	2.3×	0.7×
64	decode	1.4×	1.7×	0.5×
128	decode	1.2×	6.5×	0.3×
512	decode	10.6×	4.0×	1.0×
2048	prefill	20.2×	13.5×	1.3×
4096	prefill	113.3×	19.7×	5.1×
8192	prefill	15.7×	2.6×	0.6×

Key takeaways:

iris two_shot is the fastest variant for most prefill shapes (up to 113× faster than RCCL at M=4096), and competitive for decode.
iris one_shot wins at small decode sizes (M=1, 128) with up to 7.2× speedup.
iris ring is generally slower than RCCL for small M (decode), slightly faster for large M (prefill).
RCCL shows unusually high latency at M=2048–4096 (90+ ms) compared to M=8192 (16 ms), suggesting a tuning gap in RCCL at those sizes.

Note: These results are on MI300X (gfx942) with shared GPU access during CI. The PR description reports results on dedicated 8× MI355X hardware which may show different absolute numbers but similar relative trends.

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/63e38ebd-8485-4528-85de-cba3d0fddaec Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Accidentally merged in #489 — these are installed by the Copilot agent at runtime and should not be checked in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

mawad-amd and others added 2 commits March 26, 2026 18:27

mawad-amd requested a review from neoblizz as a code owner March 27, 2026 01:56

Copilot AI review requested due to automatic review settings March 27, 2026 01:56

mawad-amd requested a review from BKP as a code owner March 27, 2026 01:56

github-actions bot added in-progress We are working on it iris Iris project issue labels Mar 27, 2026

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Comment thread benchmark/ccl/bench_all_reduce_vllm.py Outdated

Comment thread benchmark/ccl/bench_all_reduce_vllm.py Outdated

Comment thread benchmark/ccl/bench_all_reduce.py

Comment thread benchmark/ccl/bench_all_reduce_vllm.py

Comment thread benchmark/ccl/bench_all_reduce_vllm.py

Copilot started work on behalf of mawad-amd April 15, 2026 17:24 View session

Copilot stopped work on behalf of mawad-amd due to an error April 15, 2026 17:28
The "Install IntelliKit Python packages" custom setup step from your .github/workflows/copilot-setup-steps.yml file failed.

Copilot started work on behalf of mawad-amd April 15, 2026 20:11 View session

Migrate vLLM all-reduce benchmark to iris.bench

da22048

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/63e38ebd-8485-4528-85de-cba3d0fddaec Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot finished work on behalf of mawad-amd April 15, 2026 20:27

mawad-amd merged commit 9325422 into main Apr 16, 2026

mawad-amd deleted the muhaawad/migrate-all-reduce-vllm branch April 16, 2026 05:51

mawad-amd mentioned this pull request Apr 16, 2026

Remove IntelliKit agent skills accidentally merged in #489 #512

Merged

mawad-amd added a commit that referenced this pull request Apr 16, 2026

Remove IntelliKit agent skills accidentally merged in #489 (#512)

69e5f8a

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate vLLM all-reduce benchmark to iris.bench#489

Migrate vLLM all-reduce benchmark to iris.bench#489
mawad-amd merged 4 commits intomainfrom
muhaawad/migrate-all-reduce-vllm

mawad-amd commented Mar 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawad-amd commented Apr 15, 2026

Uh oh!

Copilot AI commented Apr 15, 2026

Uh oh!

mawad-amd commented Apr 15, 2026

Uh oh!

Copilot AI commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mawad-amd commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawad-amd commented Apr 15, 2026

Uh oh!

Copilot AI commented Apr 15, 2026

Uh oh!

mawad-amd commented Apr 15, 2026

Uh oh!

Copilot AI commented Apr 15, 2026

Latency (ms)

Speedup vs RCCL (higher = better)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mawad-amd commented Mar 27, 2026 •

edited

Loading